A Compression-based Algorithm for Chinese Word Segmentation
نویسندگان
چکیده
منابع مشابه
A compression-based algorithm for Chinese word segmentation
Chinese is written without using spaces or other word delimiters. Although a text may be thought of as a corresponding sequence of words, there is considerable ambiguity in the placement of boundaries. Interpreting a text as a sequence of words is benecial for some information retrieval and storage tasks: for example, full-text search, word-based compression, and keyphrase extraction. We descr...
متن کاملAn improved MDL-based compression algorithm for unsupervised word segmentation
We study the mathematical properties of a recently proposed MDL-based unsupervised word segmentation algorithm, called regularized compression. Our analysis shows that its objective function can be efficiently approximated using the negative empirical pointwise mutual information. The proposed extension improves the baseline performance in both efficiency and accuracy on a standard benchmark.
متن کاملUsing Directed Graph Based BDMM Algorithm for Chinese Word Segmentation
Word segmentation is a key problem for Chinese text analysis. In this paper, with the consideration of both word-coverage rate and sentencecoverage rate, based on the classic Bi-Directed Maximum Match (BDMM) segmentation method, a character Directed Graph with ambiguity mark is designed for searching multiple possible segmentation sequences. This method is compared with the classic Maximum Matc...
متن کاملChinese Segmentation with a Word-Based Perceptron Algorithm
Standard approaches to Chinese word segmentation treat the problem as a tagging task, assigning labels to the characters in the sequence indicating whether the character marks a word boundary. Discriminatively trained models based on local character features are used to make the tagging decisions, with Viterbi decoding finding the highest scoring segmentation. In this paper we propose an altern...
متن کاملA Stochastic Finite-State Word-Segmentation Algorithm for Chinese
The initial stage of text analysis for any NLP task usually involves the tokenization of the input into words. For languages like English one can assume, to a first approximation, that word boundaries are given by whitespace or punctuation. In various Asian languages, including Chinese, on the other hand, whitespace is never used to delimit words, so one must resort to lexical information to "r...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Computational Linguistics
سال: 2000
ISSN: 0891-2017,1530-9312
DOI: 10.1162/089120100561746